[Model] support LTX-2 text-to-video image-to-video#841
[Model] support LTX-2 text-to-video image-to-video#841david6666666 wants to merge 35 commits intovllm-project:mainfrom
Conversation
cb1a09e to
3f3a885
Compare
5c4a679 to
72bb6c8
Compare
|
@ZJY0516 @SamitHuang @wtomin ptal, thx |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 346be1b2ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sp_size = getattr(self.od_config.parallel_config, "sequence_parallel_size", 1) | ||
| if sp_size > 1 and latent_length < sp_size: | ||
| pad_len = sp_size - latent_length | ||
| if latents is not None: | ||
| pad_shape = list(latents.shape) | ||
| pad_shape[2] = pad_len | ||
| padding = torch.zeros(pad_shape, dtype=latents.dtype, device=latents.device) | ||
| latents = torch.cat([latents, padding], dim=2) | ||
| latent_length = sp_size |
There was a problem hiding this comment.
Pad audio latents for sequence-parallel sharding
When sequence_parallel_size > 1, the LTX2 transformer shards audio_hidden_states with SequenceParallelInput (auto-pad is off), so the sequence length must be evenly divisible across ranks. Here prepare_audio_latents only pads when latent_length < sp_size, but it does nothing when latent_length is larger yet not divisible (e.g., default 121 frames @ 24fps → latent_length≈126, sp_size=4). That yields uneven shards and will fail during all‑gather or produce mismatched audio in SP runs. Consider padding latent_length up to the next multiple of sp_size (or enabling auto‑pad in the SP plan) instead of only handling the < sp_size case.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive support for the LTX-2 (Lightricks) text-to-video and image-to-video models with integrated audio generation capabilities, aligning with the diffusers library implementation (PR #12915).
Changes:
- Implements LTX2 text-to-video and image-to-video pipelines with joint audio generation
- Adds LTX2VideoTransformer3DModel with audio-video cross-attention blocks
- Integrates cache-dit support for LTX2 transformer blocks
- Extends example scripts to handle audio output alongside video frames
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py | Core LTX2 text-to-video pipeline with audio generation support |
| vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py | LTX2 image-to-video pipeline with conditioning mask and audio |
| vllm_omni/diffusion/models/ltx2/ltx2_transformer.py | Audio-visual transformer with a2v/v2a cross-attention blocks and RoPE |
| vllm_omni/diffusion/models/ltx2/init.py | Module exports for LTX2 components |
| vllm_omni/diffusion/registry.py | Registers LTX2 pipeline classes and post-processing functions |
| vllm_omni/diffusion/request.py | Adds audio_latents, frame_rate, output_type, and decode parameters |
| vllm_omni/diffusion/diffusion_engine.py | Extends engine to extract and route audio payloads from dict outputs |
| vllm_omni/entrypoints/omni_diffusion.py | Allows model_class_name override for custom pipeline selection |
| vllm_omni/entrypoints/async_omni_diffusion.py | Allows model_class_name override in async entrypoint |
| vllm_omni/diffusion/cache/cache_dit_backend.py | Adds cache-dit support for LTX2 transformer blocks |
| examples/offline_inference/text_to_video/text_to_video.py | Enhanced to handle LTX2 audio+video output and encode_video export |
| examples/offline_inference/text_to_video/text_to_video.md | Documents LTX2 usage example with frame_rate and audio_sample_rate |
| examples/offline_inference/image_to_video/image_to_video.py | Enhanced for LTX2 I2V with audio output and model class override |
Comments suppressed due to low confidence (1)
examples/offline_inference/text_to_video/text_to_video.py:100
- This assignment to 'parallel_config' is unnecessary as it is redefined before this value is used.
parallel_config = DiffusionParallelConfig(
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class LTX2ImageToVideoPipeline(LTX2Pipeline): | ||
| support_image_input = True |
There was a problem hiding this comment.
The LTX2ImageToVideoPipeline should also inherit from SupportAudioOutput and declare support_audio_output = True. Although it inherits support_image_input from the pattern in the codebase, it also produces audio output like its parent LTX2Pipeline.
Both class variables should be declared:
- support_image_input = True (already present)
- support_audio_output = True (missing)
And the class should inherit from both protocols:
class LTX2ImageToVideoPipeline(LTX2Pipeline, SupportAudioOutput):
Note: Once LTX2Pipeline properly inherits from SupportAudioOutput, this class will inherit it automatically, but it's clearer to be explicit about all supported interfaces.
| class LTX2ImageToVideoPipeline(LTX2Pipeline): | |
| support_image_input = True | |
| class LTX2ImageToVideoPipeline(LTX2Pipeline, SupportAudioOutput): | |
| support_image_input = True | |
| support_audio_output = True |
| # Configure parallel settings (only SP is supported for Wan) | ||
| # Note: cfg_parallel and tensor_parallel are not implemented for Wan models | ||
| parallel_config = DiffusionParallelConfig( | ||
| ulysses_degree=args.ulysses_degree, | ||
| ring_degree=args.ring_degree, | ||
| ) |
There was a problem hiding this comment.
The parallel_config is defined twice with identical content (lines 100-103 and lines 107-110). This is redundant code duplication. Remove one of these duplicate blocks.
The comment also mentions "only SP is supported for Wan" which may not be accurate for all models in this script (e.g., LTX2).
| num_inference_steps=args.num_inference_steps, | ||
| num_frames=args.num_frames, | ||
| frame_rate=frame_rate, | ||
| enable_cpu_offload=True, |
There was a problem hiding this comment.
The enable_cpu_offload parameter is hardcoded to True in the generate call, but it should respect the command-line argument args.enable_cpu_offload. This overrides the user's choice and always enables CPU offloading.
Change to: enable_cpu_offload=args.enable_cpu_offload,
| enable_cpu_offload=True, | |
| enable_cpu_offload=args.enable_cpu_offload, |
| return mu | ||
|
|
||
|
|
||
| class LTX2Pipeline(nn.Module): |
There was a problem hiding this comment.
The LTX2Pipeline class should inherit from SupportAudioOutput and declare support_audio_output = True as a class variable. This is necessary for the diffusion engine to properly identify that this pipeline produces audio output and handle it correctly.
The pattern is established in other audio-producing pipelines like StableAudioPipeline (see vllm_omni/diffusion/models/stable_audio/pipeline_stable_audio.py:61). Without this, the supports_audio_output() check in diffusion_engine.py:32-36 will return False, causing audio output to be incorrectly handled.
Add the import: from vllm_omni.diffusion.models.interface import SupportAudioOutput
And update the class declaration to: class LTX2Pipeline(nn.Module, SupportAudioOutput):
Then add: support_audio_output = True as a class variable.
| width, | ||
| prompt_embeds=None, | ||
| negative_prompt_embeds=None, | ||
| prompt_attention_mask=None, | ||
| negative_prompt_attention_mask=None, |
There was a problem hiding this comment.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'image'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'latents'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
| width, | |
| prompt_embeds=None, | |
| negative_prompt_embeds=None, | |
| prompt_attention_mask=None, | |
| negative_prompt_attention_mask=None, | |
| width, | |
| image=None, | |
| latents=None, | |
| prompt_embeds=None, | |
| negative_prompt_embeds=None, | |
| prompt_attention_mask=None, | |
| negative_prompt_attention_mask=None, | |
| **kwargs, |
| dtype: torch.dtype | None = None, | ||
| device: torch.device | None = None, | ||
| generator: torch.Generator | None = None, | ||
| latents: torch.Tensor | None = None, |
There was a problem hiding this comment.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method method LTX2ImageToVideoPipeline.prepare_latents matches the call.
| latents: torch.Tensor | None = None, | |
| latents: torch.Tensor | None = None, | |
| *args: Any, | |
| **kwargs: Any, |
| def check_inputs( | ||
| self, | ||
| image, | ||
| height, | ||
| width, | ||
| prompt, | ||
| latents=None, | ||
| prompt_embeds=None, | ||
| negative_prompt_embeds=None, | ||
| prompt_attention_mask=None, | ||
| negative_prompt_attention_mask=None, | ||
| ): |
There was a problem hiding this comment.
This method requires at least 5 positional arguments, whereas overridden LTX2Pipeline.check_inputs may be called with 4. This call correctly calls the base method, but does not match the signature of the overriding method.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| pass | |
| except Exception as exc: # noqa: BLE001 | |
| # If ring-parallel utilities are unavailable or misconfigured, | |
| # fall back to using the unsharded attention_mask. | |
| logger.debug( | |
| "Failed to shard attention mask for sequence parallelism; " | |
| "continuing without sharding: %s", | |
| exc, | |
| ) |
| @@ -2,11 +2,12 @@ | |||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | |||
|
|
|||
| """ | |||
There was a problem hiding this comment.
Update this model name in docs/models/supported_models.md, and if acceleration methods are applicable, update this model's name in docs/user_guide/diffusion/diffusion_acceleration.md and docs/user_guide/diffusion/parallelism_acceleration.md .
c2dc5df to
84e0305
Compare
| - `--vae_use_slicing`: Enable VAE slicing for memory optimization. | ||
| - `--vae_use_tiling`: Enable VAE tiling for memory optimization. | ||
| - `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel). | ||
| - `--tensor_parallel_size`: tensor parallel size (effective for models that support TP, e.g. LTX2). |
There was a problem hiding this comment.
how about other inference examples
|
I will rebase code and support online video serving after holiday Feb 24 |
|
@david6666666 Hey, the LTX-2 T2V/I2V support looks solid with the transformer, pipeline, and scheduler all ported from the diffusers PR. Are you still testing this? Any issues with the video generation quality or the 17-file rebase against current main? |
Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
lishunyang12
left a comment
There was a problem hiding this comment.
Left a few comments. The core model port looks thorough. Main concerns are around duplicated code and a variable shadowing issue in the engine.
| ) | ||
|
|
||
|
|
||
| def _unwrap_request_tensor(value: Any) -> Any: |
There was a problem hiding this comment.
_unwrap_request_tensor and _get_prompt_field are duplicated verbatim from pipeline_ltx2.py. Since this file already imports from .pipeline_ltx2, just import these too instead of redefining them.
| output_idx = end_idx | ||
|
|
||
| if supports_audio_output(self.od_config.model_class_name): | ||
| audio_payload = request_outputs[0] if len(request_outputs) == 1 else request_outputs |
There was a problem hiding this comment.
audio_payload is set at function scope from the dict output, then re-assigned inside the loop when supports_audio_output() is true. This shadowing is fragile -- use a different variable name for the per-request audio (e.g. request_audio_payload) to keep the two sources distinct.
| sample (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`): | ||
| The hidden states output conditioned on the `encoder_hidden_states` input, representing the visual output | ||
| of the model. This is typically a video (spatiotemporal) output. | ||
| audio_sample (`torch.Tensor` of shape `(batch_size, TODO)`): |
There was a problem hiding this comment.
Nit: docstring says audio_sample shape but the field name and usage suggest this describes the output format. Verify the shape description matches the actual output dimensions.
| # LTX2 blocks return (hidden_states, audio_hidden_states) | ||
| forward_pattern=ForwardPattern.Pattern_0, | ||
| # Treat audio_hidden_states as encoder_hidden_states in Pattern_0 | ||
| check_forward_pattern=False, |
There was a problem hiding this comment.
check_forward_pattern=False with ForwardPattern.Pattern_0 -- does Pattern_0 handle the dual-tensor return (hidden_states, audio_hidden_states) from LTX2 blocks correctly, or does cache-dit only cache the first element? Worth a comment explaining what happens to the audio branch during cached steps.
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
592b426 to
3ec6b0f
Compare
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
PR #841 Review: [Model] Support LTX-2 text-to-video image-to-video
Overview
This PR adds support for LTX-2, a text-to-video and image-to-video model from Lightricks that generates both video and audio. It includes SP, TP, CFG parallel support, and Cache-DiT optimization.
Features Supported ✅
| Feature | Status |
|---|---|
| Text-to-Video (T2V) | ✅ |
| Image-to-Video (I2V) | ✅ |
| Audio joint generation | ✅ |
| Sequence Parallel (SP) | ✅ |
| Tensor Parallel (TP) | ✅ |
| CFG Parallel | ✅ |
| Cache-DiT | ✅ |
Performance Results ✅
A100-80G (height=256, width=384):
| Config | Time | Improvement |
|---|---|---|
| Base | 39s | - |
| Cache-DiT | 26s | 33% faster |
| CFG 2 | 29s | 26% faster |
MRO Pattern Check ✅
No MRO issues detected. Classes follow proper inheritance order (nn.Module first).
Important Issues: 2 found
1. No Unit Tests for LTX2 Model
With 4480 lines of new code, having tests for the core transformer and pipeline functionality would be valuable.
2. Hardcoded Audio Sample Rate in Serving
audio_sample_rate = 24000 is hardcoded. Should come from vocoder config.
Suggestions
- Address TODO comment at
ltx2_transformer.py:1198 - Consider consolidating
fpsandframe_ratefields ininputs/data.py
Strengths
- ✅ Comprehensive T2V/I2V implementation with audio support
- ✅ Good performance optimizations (33% faster with Cache-DiT)
- ✅ Proper parallelism support (SP, TP, CFG)
- ✅ Clean architecture with clear class hierarchy
- ✅ Documentation updated
Recommendation
Add basic unit tests and fix hardcoded audio sample rate, then ready for merge.
|
|
||
| result = await self._run_generation(prompt, gen_params, request_id, raw_request) | ||
| videos = self._extract_video_outputs(result) | ||
| audios = self._extract_audio_outputs(result, expected_count=len(videos)) |
There was a problem hiding this comment.
Hardcoded audio sample rate
Consider getting the sample rate from the vocoder config instead of hardcoding:
# Instead of:
audio_sample_rate = 24000
# Use:
audio_sample_rate = self.engine.model.vocoder.config.output_sampling_rateThis ensures consistency if the model uses a different sample rate.
| freqs = freqs.transpose(-1, -2).flatten(2) # [B, num_patches, self.dim // 2] | ||
|
|
||
| # 5. Get real, interleaved (cos, sin) frequencies, padded to self.dim | ||
| # TODO: consider implementing this as a utility and reuse in `connectors.py`. |
There was a problem hiding this comment.
TODO comment
Consider addressing this before merge or creating a tracking issue for the utility refactoring.
| height: int | None = None | ||
| width: int | None = None | ||
| fps: int | None = None | ||
| frame_rate: float | None = None |
There was a problem hiding this comment.
Duplicate fields: fps vs frame_rate
Both fps: int | None and frame_rate: float | None exist. Consider consolidating these to avoid confusion, or document why both are needed (e.g., fps for video encoding, frame_rate for model inference).
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
support LTX-2 text-to-video image-to-video, refer to huggingface/diffusers#12915
Test Plan
t2v:
diffusers:
i2v:
diffusers:
online serving:
Test Result
t2v:
ltx2_t2v_diff.mp4
i2v:
ltx2_i2v_diff.mp4
A100-80G height=256 width=384
cache-dit:
39s -> 26s
ulysses_degree 2:
39s -> 38s
ring_degree 2:
39s -> 38s
cfg 2:
39s -> 29s
tp 2:
39s -> 38s
Checklist
LTX-2
LTX-2 follow prs:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)